skip to main content


Search for: All records

Creators/Authors contains: "Lin, Jinfeng"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Taxonomies serve many applications with a structural representation of knowledge. To incorporate emerging concepts into existing taxonomies, the task of taxonomy completion aims to find suitable positions for emerging query concepts. Previous work captured homogeneous token-level interactions inside a concatenation of the query concept term and definition using pre-trained language mod- els. However, they ignored the token-level interactions between the term and definition of the query concepts and their related concepts. In this work, we propose to capture heterogeneous token-level interactions between the different textual components of concepts that have different types of relations. We design a relation-aware mutual attention module (RAMA) to learn such interactions for taxonomy completion. Experimental results demonstrate that our new taxonomy completion framework based on RAMA achieves the state-of-the-art performance on six taxonomy datasets. 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  2. Recent breakthroughs in deep-learning (DL) approaches have resulted in the dynamic generation of trace links that are far more accurate than was previously possible. However, DL-generated links lack clear explanations, and therefore non-experts in the domain can find it difficult to understand the underlying semantics of the link, making it hard for them to evaluate the link's correctness or suitability for a specific software engineering task. In this paper we present a novel NLP pipeline for generating and visualizing trace link explanations. Our approach identifies domain-specific concepts, retrieves a corpus of concept-related sentences, mines concept definitions and usage examples, and identifies relations between cross-artifact concepts in order to explain the links. It applies a post-processing step to prioritize the most likely acronyms and definitions and to eliminate non-relevant ones. We evaluate our approach using project artifacts from three different domains of interstellar telescopes, positive train control, and electronic healthcare systems, and then report coverage, correctness, and potential utility of the generated definitions. We design and utilize an explanation interface which leverages concept definitions and relations to visualize and explain trace link rationales, and we report results from a user study that was conducted to evaluate the effectiveness of the explanation interface. Results show that the explanations presented in the interface helped non-experts to understand the underlying semantics of a trace link and improved their ability to vet the correctness of the link. 
    more » « less
  3. null (Ed.)
    Software traceability establishes and leverages associations between diverse development artifacts. Researchers have proposed the use of deep learning trace models to link natural language artifacts, such as requirements and issue descriptions, to source code; however, their effectiveness has been restricted by availability of labeled data and efficiency at runtime. In this study, we propose a novel framework called Trace BERT (T-BERT) to generate trace links between source code and natural language artifacts. To address data sparsity, we leverage a three-step training strategy to enable trace models to transfer knowledge from a closely related Software Engineering challenge, which has a rich dataset, to produce trace links with much higher accuracy than has previously been achieved. We then apply the T-BERT framework to recover links between issues and commits in Open Source Projects. We comparatively evaluated accuracy and efficiency of three BERT architectures. Results show that a Single-BERT architecture generated the most accurate links, while a Siamese-BERT architecture produced comparable results with significantly less execution time. Furthermore, by learning and transferring knowledge, all three models in the framework outperform classical IR trace models. On the three evaluated real-word OSS projects, the best T-BERT stably outperformed the VSM model with average improvements of 60.31% measured using Mean Average Precision (MAP). RNN severely underperformed on these projects due to insufficient training data, while T-BERT overcame this problem by using pretrained language models and transfer learning. 
    more » « less
  4. Automatic construction of a taxonomy supports many applications in e-commerce, web search, and question answering. Existing taxonomy expansion or completion methods assume that new concepts have been accurately extracted and their embedding vectors learned from the text corpus. However, one critical and fundamental challenge in fixing the incompleteness of taxonomies is the incompleteness of the extracted concepts, especially for those whose names have multiple words and consequently low frequency in the corpus. To resolve the limitations of extraction-based methods, we propose GenTaxo to enhance taxonomy completion by identifying positions in existing taxonomies that need new concepts and then generating appropriate concept names. Instead of relying on the corpus for concept embeddings, GenTaxo learns the contextual embeddings from their surrounding graph-based and language-based relational information, and leverages the corpus for pre-training a concept name generator. Experimental results demonstrate that GenTaxo improves the completeness of taxonomies over existing methods. 
    more » « less
  5. Software traceability provides support for various engineering activities including Program Comprehension; however, it can be challenging and arduous to complete in large industrial projects. Researchers have proposed automated traceability techniques to create, maintain and leverage trace links. Computationally intensive techniques, such as repository mining and deep learning, have showed the capability to deliver accurate trace links. The objective of achieving trusted, automated tracing techniques at industrial scale has not yet been successfully accomplished due to practical performance challenges. This paper evaluates high-performance solutions for deploying effective, computationally expensive traceability algorithms in large scale industrial projects and leverages generated trace links to answer Program Comprehension Queries. We comparatively evaluate four different platforms for supporting industrial-scale tracing solutions, capable of tackling software projects with millions of artifacts. We demonstrate that tracing solutions built using big data frameworks scale well for large projects and that our Spark implementation outperforms relational database, graph database (GraphDB), and plain Java implementations. These findings contradict earlier results which suggested that GraphDB solutions should be adopted for large-scale tracing problems. 
    more » « less
  6. Software traceability establishes associations between diverse software artifacts such as requirements, design, code, and test cases. Due to the non-trivial costs of manually creating and maintaining links, many researchers have proposed automated approaches based on information retrieval techniques. However, many globally distributed software projects produce software artifacts written in two or more languages. The use of intermingled languages reduces the efficacy of automated tracing solutions. In this paper, we first analyze and discuss patterns of intermingled language use across multiple projects, and then evaluate several different tracing algorithms including the Vector Space Model (VSM), Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and various models that combine mono- and cross-lingual word embeddings with the Generative Vector Space Model (GVSM). Based on an analysis of 14 Chinese-English projects, our results show that best performance is achieved using mono-lingual word embeddings integrated into GVSM with machine translation as a preprocessing step. 
    more » « less
  7. In many regulated domains, traceability is established across diverse artifacts such as requirements, design, code, test cases, and hazards -- either manually or with the help of supporting tools, and the resulting trace links are used to support activities such as impact analysis, compliance verification, and safety inspections. Automated tracing techniques need to leverage the semantics of underlying artifacts in order to establish more accurate trace links and to provide explanations of links that have been created in either a manual or automated fashion. To support this, we propose an automated technique which leverages source code, project artifacts and an external domain corpus to generate a domain-specific concept model. We then use the generated concept model to improve traceability results and to provide explanations of the results. Our approach overcomes existing problems with deep-learning traceability algorithms, as it does not require a training set of existing trace links. Finally, as an initial proof-of-concept, we apply our semantically-guided approach to the Dronology project, and show that it improves over other tracing techniques that do not use a concept model. 
    more » « less
  8. Software projects produce large quantities of data such as feature requests, requirements, design artifacts, source code, tests, safety cases, release plans, and bug reports. If leveraged effectively, this data can be used to provide project intelligence that supports diverse software engineering activities such as release planning, impact analysis, and software analytics. However, project stakeholders often lack skills to formulate complex queries needed to retrieve, manipulate, and display the data in meaningful ways. To address these challenges we introduce TiQi, a natural language interface, which allows users to express software-related queries verbally or written in natural language. TiQi is a web-based tool. It visualizes available project data as a prompt to the user, accepts Natural Language (NL) queries, transforms those queries into SQL, and then executes the queries against a centralized or distributed database. Raw data is stored either directly in the database or retrieved dynamically at runtime from case tools and repositories such as Github and Jira. The transformed query is visualized back to the user as SQL and augmented UML, and raw data results are returned. Our tool demo can be found on YouTube at the following link:http://tinyurl.com/TIQIDemo. 
    more » « less